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ABSTRACT 

The  philosophy  of  robust  procedures  is  discussed.  It  is 
argued  that  the  present  emphasis  by  statistical  researchers 
on  ad  hoc  methods  of  robust  estimation  is  mistaken.  Classical 
methods  of  estimation  should  be  retained  using  models  which 
more  appropriately  represent  reality.  Attention  should  not  be 
confined  merely  to  discrepancies  arising  from  outliers  and 
heavy  tailed  distributions  but  should  be  extended  to  include 
serial  dependence,  need  for  transformation  and  other  problems. 
Some  researches  of  this  kind  using  Bayes  theorem  are  discussed. 
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SIGNIFICANCE  AND  EXPLANATION 

In  recent  years  attempts  have  been  made  to  render  t fie 
piocess  ot  statistical  est  imation  of  model  parameters  less  sensi- 
t 1 ve  to  violation  ot  assumptions  such  as  exact  normality  ot  the 
et i or  dist  r ibut ion . 

‘I'lie  current  methods  of  "robust  if  ication"  art'  largely 
empirical  and  rely  on  ad  hoc  modification  of  classical  est  imation 
methods  such  as  maximum  likelihood.  In  this  paper  it  is  argued 
that  tin'  process  of  robust i f icat ion  is  best  carried  out,  not 
b\  modifying  the  estimation  method,  but  by  suitably  elaborating 
tlit*  model  and  then  using  classical  estimation  procedures. 

This  approach  has  the  advantage  of  making  explicit  what  is 
being  assumed  and  also  of  generality.  Applications  are  made 
!>>  obtaining  estimates  which  are  robust  with  respect,  to  non- 
normality,  serial  correlation,  transformation  and  the  existence 
o i out  1 i ers . 


Accession  For 

NHS  yjf- 


\ Dhc  l’A.i 

Utinnno, 

Jtl.‘  t t i u i ' 


r 


: y 

l ' i f ' l ; 

_Av''J.  ' ’ ' * v ' Vfl 


Dint 


A 


“ ■ • * r ..i  or 
r.j  octal 


The  respons ibl 1 i ty  for  the  wording  and  views  expressed  in  this 
desc>.  iptive  summary  lies  with  MRC,  and  not  with  t he  author  ot 
this  report. 


ROBUSTNESS  IN  THE  STRATEGY  OF  SCIENTIFIC 
MODEL  BUILDING* 

G.  E.  P.  Box 


Robustness  may  be  defined  as  the  property  of  a procedure 
which  renders  the  answers  it  gives  insensitive  to  departures, 
of  a kind  which  occur  in  practice,  from  ideal  assumptions. 
Since  assumptions  imply  some  kind  of  scientific  model,  I 
believe  that  it  is  necessary  to  look  at  the  process  of 
scientific  modelling  itself  to  understand  the  nature  of  and 
the  need  for  robust  procedures.  Against  such  a view  it  might 
be  urged  that  some  useful  robust  procedures  have  been  derived 
empirically  without  an  explicitly  stated  model.  However,  an 

empirical  procedure  implies  some  unstated  model  and  there  is 

, ...  * 
often  great  virtue  in  bringing  into  the  open  the  kind  of 

assumptions  that  lead  to  useful  methods.  The  need  for  robust 

methods  seems  to  be  intimately  mixed  up  with  the  need  for 

simp] e models.  This  we  now  discuss. 


A paper  read  at  the  Army  Research  Office  Workshop  on  Robustness 
in  Statistics  held  at  Research  Triangle  Park,  North  Carolina 
on  April  11-12,  1978. 

* 

An  example  (1),  (2)  was  the  application  in  the  1950's  of 
exponential  smoothing  for  business  forecasting  and  the  wide 
adoption  in  this  century  of  three-term  controllers  for 
process  control.  It  was  later  realized  that  these  essentially 
empirical  procedures  point  to  the  usefulness  of  ARIMA  time 
series  models  since  both  are  optimal  for  disturbances 
generated  by  such  models. 


Sponsored  by  the  United  States  Army  under  Contract  No. 
DAAG29-7  5-C-00  24 . 


PARSIMONY 


THE  NERD  FOR  SIMPLE  SCIENTIFIC  MODELS  - 


The  scientist,  studying  some  physical  or  biological 
system  and  confronted  with  numerous  data,  typically  seeks  for 
a model  in  terms  of  which  the  underlying  characteristics  of 
the  system  may  be  expressed  simply. 

For  example,  he.  might  consider  a model  of  the  form 


yu  = f(p)(Su®)  + lu  (u  = 1,2, ...,n)  (1) 

in  which  the  expected  value  n of  a measured  output  y is 

u u 

represented  as  some  function  of  k inputs  £ and  of  p 
parameters  §,  and  is  an  "error".  One  important  measure 

of  simplicity  of  such  a model  is  the  number  of  parameters 
that  it  contains.  When  this  number  is  small,  we  say  the  model 
is  parsimonious . 

Parsimony  is  desirable  because  (i)  when  important  aspects 
of  the  truth  are  simple,  simplicity  illuminates,  and  complica- 
tion obscures;  (ii)  parsimony  is  typically  rewarded  by 
increased  precision  (see  Appendix  1);  (iii)  indiscriminate 

model  elaboration  is  in  any  case  not  a practical  option 

* 

because  this  road  is  endless  . 


ALL  MODELS  ARE  WRONG  BUT  SOME  ARE  USEFUL 

Now  it  would  be  very  remarkable  if  any  system  existing 
in  the  real  world  could  be  exactly  represented  by  any  simple 
model.  However,  cunningly  chosen  parsimonious  models  often  do 


Suppose  for  example  that  in  advance  of  any  data  we  postulated 
a model  of  the  form  of  (1)  with  the  usual  normal  assumptions. 
Then  it  might  bo  objected  that  the  distribution  of  eu  might 
turn  out  to  be  heavy-tailed . In  principle  this  difficulty 
could  be  allowed  for  by  replacing  the  normal  distribution  by 
a suitable  family  of  distributions  showing  varying  degrees  of 
kurtosis.  But  now  it  might  be  objected  that  the  distribution 
might  be  skew r.  Again,  at  the  expense  of  further  parameters 
to  be  estimated,  we  could  again  elaborate  the  class  of  distri- 
bution considered.  But  now  the  possibility  might  be  raised 
that  the  errors  could  be  serially  correlated.  We  might 
attempt  to  deal  with  this  employing,  say,  a first  order  auto- 
regressive error  model.  However,  it  could  then  be  argued  that 
it  should  be  second  order  or  that  a model  of  some  other  type 
ought  to  be  employed.  Obviously  these  possibilities  are 
extensive,  but  they  are-  not  the  only  ones:  the  adequacy  of 
the  form  of  the  function  f (4,0)  could  be  called  into  ques- 
tion and  elaborated  in  endless  ways ; the  choice  of  input  vari- 
ables f,  might  be  doubted  and  so  on. 


prowuo  xeh.aixubly  uaotul  approximations.  For  example,  t.hc 
law  l’V  - KT  relating  pressure  r,  volume  V md  temperature 
T of  an  "ideal"  gas  via  a constant  R is  not  exactly  true  for 
any  real  gas,  but  it  frequently  provides  a useful  approxima- 
tion and  furthermore  its  structure  is  informative  since  it 
springs  from  a physical  view  of  the  behavior  of  gas  molecules. 

For  such  a model  there  is  no  need  to  ask  the  question 
"Is  the  model  true?".  If  "truth"  is  to  be  the  "whole  truth" 
the  answer  must  be  "No".  The  only  question  of  interest  is 
"is  the  model  illuminating  and  useful?". 

m.KATUT  PROCFSF  OF  MOP  PL  BUILDING 

How  then  is  the  model  builder  to  know  what  aspects  to 
include  and  what  to  omit  so  that  parsimonious  models  that  are 
illuminating  and  useful  result  from  the  model  building 
process?  Vs'e  have  seen  that  it  is  fruitless  to  attempt  to 

allow  toi  all  contingencies  in  advance  so  in  practice  model 

* 

building  must  be  accomplished  by  iteration  the  inferential 
stage  ot  which  is  i1 lust  rated  in  Figure  1. 

Condit ionul  inference 

T<  n tat  \ ve  mode  1 1 
1 -« 

Cr  i t i c l sin  us  i ng 
residual  diagnoitio  cheeks 
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iterative  building  pro*.  ess  loi  scientific  models  can  take 
I'vet  shoi  t ot  Iona  pci  10a  . t t i me , and  can  involve  one 
t 1 1 ; a 1 1 ' t or  many.  One  uitei  e t i no  i sample  is  tin  process 
s.  ever  y of  the  stiuetmo  ot  DN.\  described  by 
Vat  son  [ l)  . Anot  he  i is  the  dove  1 opulent  by  K.  A.  Fisher 
the  t heoi  y ot  espoi  ins  ital  design  between  1 ') ? and 
Thi'  recognition  that  sclent  it  ic  model  building  is 
i at  l ve  piocess  q-'os  back  te  • ueh  classical  .uitlinrs  as 
tot  le,  bror.stest  e ana  bacon.  The  suggest  ion  that 
■a  i ea  1 pi  ov  cdui  e*.  ought  to  be  viewed  ill  this  iterative 
st  v as  discussed  for  example  in  111,  (SI,  U'lr  (7J,  122). 


study,  may  suggest  a t 11  st  model  woi  t hy  to  bo  tentatively 
entertained.  I'uw  this  model  a correspond i ng  tust  tenta- 
t ivo  analysis  may  bo  made  as  it  wo  believed  it.  The  tentative 
inferences  made,  like  all  infoionces  are  cond i t tonal  on  the 
nppl  icab.il  ity  of  the  impl  it'd  model,  but  the  invest  iuatoi  now 
quickly  switches  his  attitude  t rom  that  of  sponsot  t o that  ol 
critic.  In  tact  t he  tentative  analysis  provides  a basis  foi 
ci  it  it'  ism  ot  the  nit' do  . This  criticism  phase  is  accontpl  ished 
by  exam  i n i nq  residual  quantities  using  graphical  piocodures 
and  sometimes  more  formal  tests  ot  fit.  Such  diagnostic 
chocks  may  tail  1 1'  show  cause  !oi  dtmbtiiiti  the  motiol  *s 
app 1 1 cabi l i t y , otherwise  it  may  point  t o mod i f lent  ion  of  the 
model  leadinu  to  a now  tent  at  ivo  model  and,  in  turn,  to  a 
further  cycle  of  the  iteration. 

WAVS  TO  A1  LOW  FOR  MODEL  DISCREPANCIES 

How  can  we  avoid  the  possibility  that  the  parsimonious 
models  we  build  by  such  an  iteration  miuht  be  misleading? 

There  aie  two  answers. 

a)  Know  i nq  the  scient  it  ic  context  ot  an  invest  iq.it  ion  we 
can  allow  in  advance  foi  mote  important  cent ingoncies. 

b)  Suitable  analysis  ot  residuals  van  lead  to  our  fixinq  up 
the  model  in  other  needed  di lections. 

We  call  the  t irst  course  model  tobust it icat ion  the  second 
i t i't  at  i vo  t i x i ng  . 

JUPIC  1PUH  MOl'H  RPIU’ST 1 FI  CAT  TON 

Expel  ienco  with  data  and  known  vulnerabilities  ot 
stat  istical  procedures  in  a specific  scient  it  ic  context  will 
-licit  the  sensitive  practitioner  to  likely  d i screpanc  ies 
that  can  cause  problems.  He  may  then  judiciously  and  grudg- 
inqly  el  abora  t e the  model  m ’ henei'  the  t os  lit  i iwj  piooedure 
so  as  to  insure  against  particulai  hazards  in  the  most 

n 

parsimonious  mannei . Models  providing  foi  simple  foi ms  ot 


3 

It  is  currently  tashi enable  to  conduct  robustness  studies 
in  which  the  normality  nssur.pt  ion  is  relaxed  (in  tavoi  ot 
h ii  \ tailed  d i :*•  t r i bu  t U'n  oi  d i s t i l bu t i oils  cent  a i n i ng  vut 
l let's)  but  all  other  assumptions  ai  e retained.  Thus  it  is 
‘•till  assumed  that  errors  ate  independent  that  ttnnsfcim.i 
tions  aic  Mteotl\  sp«'Cificd  and  so  on.  This  seems  to  be 
too  naive  ami  tiaiiew  a view. 


autocorrelation  in  serial  data,  for  simple  transformation  in 
data  covering  wide  ranges,  for  outliers  in  the  almost 
universal  situation  where  perfect  control  of  the  experimental 
process  is  not  available,  are  all  examples  of  commonly  needed 
parsimonious  elaborations  which  can  have  major  consequences. 

it?:rative  fixing  using  diagnostic  checks 

Once  it  is  recognized  that  the  choice  of  model  is  not 
an  irrevocable  decision,  the  investigator  need  not  attempt  to 
allow  for  all  contingencies  a priori  which  as  we  have  said  is 
in  any  case  impossible.  Instead,  after  appropriate  robus- 
tification,  he  may  look  at  residual  quantities  in  an  attempt 
to  reveal  discrepancies  not  already  provided  for. 

To  better  appreciate  such  a process  of  iterative  fixing, 
write  the  model  (1)  in  the  form 

*u  = f(§lu)  + £(§2u>  (2) 

where  now  the  vector  previously  denoted  in  (1)  by  4., 

represents  those  variables  the  investigator  has  specifically 
decided  to  study.  The  expression  e ^2u^  which  replaces  f.( 
indicates  explicitly  that  the  error  eu  represents  the  joint 
influences  on  the  output  of  all  those  other  input  variables 
^2U  which  are  omitted  from  the  model  (usually  because  they 
are  unknown) . Many  statistical  procedures  (in  particular 
quality  control,  residual  analysis  and  evolutionary  opera- 
tion) are  concerned  with  discovering  "assignable  causes"  - 
elements  of  “ which  may  be  moved  out  of  the  unknown  to 

the  known  as  indicated  by 

y = f(£,  ) + £(£.,  ) . (3) 

1 u -lu  ~2u 

Now  let  (at)  be  a white  noise  sequence.  That  is  a 
sequence  of  identically  and  independently  distributed  rani  ..u 
variables  having  zero  mean.  If  we  now  denote  the  n values 
of  response  and  known  inputs  by  y and  respectively 

then  an  ideal  model 

Ft(y,^l^  - at*  t = 1,2, ... ,n  ' * 

would  consist  of  a transformation  of  the  data  to  whi te  n' 
which  was  statistically  independent  of  any  other  input . 


iteration  towards  such  a model  is  partially  motivated  by 
diagnostic  checking  through  examination  of  residuals  at  each 
st aye  of  the  evolvinq  model. 

Thus,  patterns  -.if  residuals  can  indicate  the  presence 
ot  outliers  or  suggest  the  necessity  for  including  specific 
new  inputs.  Also  serial  correlation  of  residuals  shown  tor 
example  by  plotting  veisus  a^  (or  more  formally  by 

examining  e.  t :::utos  of  sample  autocorrel  a t ions  for  low  lags  k) 
cat',  point  to  the  need  for  allowing  foi  serial  correlation  in 
the  original  noise  model.  Again  as  was  shown  by  Tukey  [8], 
dependence  of  y - y on  y can  indicate  the  need  for 
tram; : orma t ion  of  the  output.  Examination  of  residuals  at 
each  stage  parallels  the  chemical  examination  of  the  filtrate 
t rom  an  extract  ion  process.  When  we  can  no  longer  discover 
any  infer  mat  ion  in  the  residuals  then  we  can  conclude  that 

it 

all  extractable  information  is  in  the  model. 

MOD EE  ROBUST! EICATION  AND  DIAGNOSTIC  CHECKING 

Robust  if ication  and  iterative  fixing  following  diagnostic 
checking  of  residuals  are  of  course  not  rival  but  complemen- 
tary techniques  and  we  must  try  to  see  how  to  use  both 
wisely . 

Subject  matter  knowledge  will  often  suggest  the  need  for 
robust  1 1 icat  ion  by  parsimonious  model  elaboration.  For 
example,  when  models  such  as  (2)  are  used  in  economics  and 
business  the  output  and  input  variables  {y^},  {£ju}  arc 
often  collected  serially  as  t ime  ser  ies . They  are  then  very 
likely  to  be  autocorrel ated . If  this  is  so  then  the 

_ . 

It  should  be  remembered  that  just  as  the  Declaration  of 
Independence  promises  tire  pursu  i t of  happiness  rather  than 
happiness  itself,  so  the  iterative  scientific  model  building 
process  offers  only  the  pursuit  of  the  perfect  model.  For 
even  when  we  feel  we  have  carried  the  model  building  process 
to  a conclusion  some  new  initiative  may  make  further  improve- 
ment possible.  Fortunately  to  bo  useful  a model  does  not 
have  to  be  perfect. 

In  particular  notice  that,  even  though  residuals  from 
some  model  ait'  consistent  with  a white  noise  error,  this  does 
not  bar  further  model  improvement.  For  example,  this  white 
noire  error  could  depend  on  (theoretically,  even,  be  propor- 
tional to)  the  white  noise  component  of  some,  so  far 
unrocogn  i red , input  variable. 


cop  ' * {C  } in  the  error  c{  ) are  ouually 

likely  to  have  this  characteristic.  It  makes  little  sense, 
in  this  cont os t , therefore,  to  postulate  even  tentatively  a 
model  of  the  form  of  (1)  in  which  the  t are  supposed 
independent.  Instead  the  representation  of  t by  a simple 
time  scries  model  ( foi  example  a first  order  autoregressive 
process  tor  which  serial  correlation  falls  off  exponent ia  lly 
with  lag)  would  provide  a much  more  plausible  starting  place. 
Failure  either,  to  robust ify  the  model  in  this  way  initially, 
or,  to  check  for  serial  correlation  in  residuals,  resul ted  in 
one  published  example  in  t values  (measuring  the  signif- 
icance of  regression  coef f icients)  which  were  inflated  by  an 
older  of  magnitude  [9,10].  We  discuss  this  example  in  more 
detail  later . 

Again  statistical  analysis  in  an  inappropriate  metric  can 
1 ad  to  wasteful  inefficiency.  Fo:  instance,  textile  data  are 
presented  in  [11]  where  appropriate  transformation  would  have 
resulted  in  a three  fold  decrease  in  the  relative  variance 
accompanied  by  reduction  rn  the  number  of  needed  parameters 
! vein  ten  to  four,  for  the  expenditure  of  only  one  estimated 
’ ’ aust ormation  parameter. 

In  both  examples  discussed  above  a profound  improvement 
statistical  analysis  is  made  possible  by  suitable  robus- 
' ; i.  at  ion  of  the  model  the  need  for  which  could  have  been 

. -t voted  by  suitable  diagnostic  checks  on  residuals. 

O DANCE  OF  UNDETECTED  MI ^SPECIFICATION 

Unfortunately  we  cannot  always  rely  on  diagnostic  checks 
reveal  serious  model  mi s.spec i f ieat ion . The  dangerous 
ition  is  that  where  initial  model  misspeci f icat ion  can 
i 'suit  in  .i  seriously  misleading  analysis  whose  i nappropt  tate- 
is  unlikely  to  be  detected  by  diagnost ic  checks. 

For  example,  the  widely  used  model  formulation  (1) 
poses  its  applicability  tor  every  observation  y 

l,2,...,n)  and  so  explicitly  excludes  the  possibility 
o outliers.  If  such  a model  is  (inappropriately!  assumed 
: ho  common  situations  where  occasional  aooi dents  possibly 
hi  ng  to  outliers  are  t_o  b>'  expected , then  any  sensible 
< hod  of  est  imat  ion  such  as  maximum  likelihood  it  .replied 


-7- 


r 

us  it  i y this  i nappropr  i at  e mode  1 must  tend  to  conceal  model 
inadequacy.  This  i ; because,  in  Older  to  follow  the 
mathematical  instructions  presented,  it  must  choose 
parameters  which  make  residuals  even  with  this  wrong  model 
look  as  much  as  possible  like  white  noise.  That  this  has  led 
some  investigators  to  abandon  standard  inferential  methods 
rather  than  the  misleading  mode l seems  perveise. 

As  a further  example  of  the  hazard  of  undetected 
mi sspeci f icu t ion  consider  scientific  problems  requiring  the 
comparison  of  variances.  Using  standatd  normal  assumptions 
the  investigator  might  be  led  to  conduct  an  analysis  based 
on  Bartlett's  test.  However  this  procedure  is  known  to  be 
so  sensitive  to  kurtosis  that  nonnormality  uni i kol y to  be 
detected  by  d iagnos t i c checks  could  seriously  invalidate 
results.  This  characteristic  of  the  test  is  well  known,  of 
course,  and  it  has  long  been  recognized  that  the  wise 
researcher  should  robust ify  initially.  That  is  he  should  use 
a robust  alternative  to  Bartlett's  test  ab  initio  rather  than 
relying  on  a test  of  nonnormality  followed  by  possible  fix  up. 

The  conclusion  is  that  the  role  of  model  robust i f ication 
is  to  take  care  of  likely  di screpancies  that  have  dangerous 
consequences  and  are  difficult  to  detect  by  diagnostic  checks. 
This  implies  an  ability  by  statisticians  to  worry  about  the 
right  things.  Unfortunately  they  have  not  always  demonstrated 
this  talent,  see  for  example  Appendix  2. 

ROBUSTNKSS  AND  ERROR  TRANSMISSION 

Since  we  need  parsimonious  models  but  we  know  they  must 
be  false  we  are  led  to  consider  how  much  deviation  from  the 
model  of  a kind  typically  met  in  practice  will  affect  the 
procedure  derived  on  the  assumption  that  a model  is  exact. 

The  problem  is  analogous  to  the  classical  problem  of 
error  transmission.  In  its  simplest  manifestation  that 
problem  can  be  expressed  as  follows: 

Consider  a calibration  function 

‘ Y = f(P)  (5) 

which  is  used  to  determine  y at  some  value  say  P Prt. 

u 

Suppose  that  the  function  is  mistakenly  evaluated  .it  some 

L -as 


other  value  of  B,  then  the  resulting  error  c transmitted 
into  y is 

( Yq  + c)  - Y0  = f(6)  “ f(B0)  = (8  - 6q)  x P (b) 

3y  i 

where  p = -5-3- 

8elB=0 

The  expression  for  the  transmitted  error  e contains 
two  factors  S and  p.  The  first  is  the  size  of  the  input 
error  the  second  which  we  will  call  the  specific  transni ss 3 on 
is  the  rate  of  change  of  y as  8 is  changed.  It  is 
frequently  emphasized  in  discussing  error  transmission  that 
both  factors  are  important.  In  particular  the  existence  of 
a large  discrepancy  B - Bq  does  not  lead  to  a large 
transmitted  error  e if  p is  small.  Conversely  even  a 
small  error  8 can  produce  a large  error  e if  p is  large. 

Now  consider  a distribution  of  errors  p(B).  Knowledge 
of  the  relation  y = f(8)  allows  us  to  deduce  the  correspond- 
ing distribution  p(e).  In  particular  if  the  approxinatio 
(6)  may  be  employed  then  a = paQ.  The  relevance  of  the 

T D 

above  robustness  studies  is  as  follows.  Suppose  y is  so:  ^ 
performance  characteristic  of  a statistical  procedure  which 
it  is  desired  to  study.  This  characteristic  might  be  some 
measure  of  closeness  of  an  estimate  to  the  true  value, 
significance  level,  the  length  of  a confidence  interval,  a 
critical  probability,  a posterior  probability  distribution 
or  a rate  of  convergence  of  some  measure  of  efficiency  or 
optimality.  Also  suppose  B is  some  measure  of  departure 
from  assumption  such  as  a measure  of  nonnormal  kurtosis  o 
skewness  or  autocorrelation  of  the  error  distribution  and 
suppose  that  8 = Bq  is  the  value  taken  on  standard  assump- 
tions. Then  in  the  error  transmission  problem  three  featu 
of  importance  are 

(1)  The  distribution  of  B.  This  provides  the  probabilit; 
distribution  of  deviations  from  assumption  which  are  actual _v 
encountered  in  the  real  world.  Notice  this  feature  has 
nothing  to  do  with  mathematical  derivation  or  with  the 
statistical  procedure  used. 

(ii)  The  specific  transmission  p.  This  is  concerned  wi 
the  error  transmission  characteristics  of  the  statistical 


procedure  actually  employed  and  may  bo  studied  mathematically 
It  is  wo]]  known  that  different  statistical  procedures  can 
have  widely  different  > 's.  Ari  example  already  quoted  is  the 
extreme  sensitivity  to  distribution  kurtosis  of  the  signif- 
icance level  of  likelihood  ratio  tests  to  compare  variances 
(Bartlett's  tost)  and  the  comparative  insensitivity  of 
corresponding  tests  to  compare  means  (Analysis  of  variance 
tests)  . 

(iii)  If  the  data  set  is  of  sufficient  size  it  can  itself 
provide  information  about  the  discrepancy  8 - 8q  which 

occurs  in  that  particular  sample,  thus  reducing  reliance  on 

* 

prior  knowledge.  Conversely  if  the  sample  size  is  small  or 
if  6 is  of  such  a nature  that  a very  large  sample  is  needed 
to  gain  even  an  approximate  idea  of  its  value,  heavier  reli- 
ance must  be  placed  on  prior  knowledge  (whether  explicitly 
admitted  or  not). 

It.  seems  to  me  that  these  three  characteristics  taken 
together  determine  what  we  should  worry  about.  They  are  all 
incorporated  precisely  and  appropriately  in  a Bayes  formula- 
tion . 

BAY BS  THEOREM  AS  A MEANS  OF  STUDYING  ROBUSTNESS 

From  a Bayesian  point  of  view  given  data  y all  valid 
inferences  about  parameters  0 can  be  made  from  an  appro- 
priate posterior  distribution  p(0|v).  To  study  the  robust- 
ness of  such  inferences  when  discrepancies  8 from  assump- 
tions occur  we  can  proceed  as  follows: 

Consider  a naive  model  relating  data  y and  parameters 
0.  For  example,  it  might  assume  that  p(y|0)  was  a 
spherically  normal  density  function,  that  E(y)  was  linear 
in  the  parameters  0 and  that  before  the  data  became  avail- 
able the  desired  state  of  ignorance  about  unknown  parameters 
was  expressed  by  suitable  non-informative  prior  distributions 
leading  to  the  standard  analysis  of  variance  and  regression 
procedures.  Suppose  it  was  feared  that  certain  discrepancies 


However  even  the  small  amount  of  information  about  8 avail 
abb  f ton  a .small  sample  can  be  important.  See  for  instance 
the  a i tlyi  is  of  Darwin's  data  which  follows  (Example  1). 


in- 


l i>'.  . i .if  m i y h t occur  (for  example  luck  of  indeper  lence, 

lift'd  tor  transformation,  existence  of  outliers,  non-normal 
km  to:  is  etc.)  . Two  questions  of  interest  ate  (A)  how  sensi- 
tive art.-  inferences  made  about  0 to  these  contemplated 
mi  sspec  i f icat  ions  of  the  model?  (U)  If  necess.il  y hew  may 
such  inferences  be  made  robust  against  such  discrepancies  as 
actually  ocem  in  practice? 

t:'.  11  ST  ION  (A)  SKNS1TWITY 

Suppose  in  all  cases  that  discrepancies  are  parameterized 
by  p.  Also  suppose  the  density  function  for  ^ given  0 
and  0 is  P (y I 0 , p)  and  that  p ( 0 | P ) is  a non- in  formative 
pi i ox  for  O given  P . Then  comprehensive  inferences  about 
0 given  P and  y may  be  made  in  terms  of  the  posterior 
tl  i : t i i but  ion 

P ( 0 | P * y ) k p (y  | 0,  P)  P (0  | 8)  (7) 

wln'ic  k is  a normalizing  constant.  Sensitivity  ot  such 
interfaces  to  changes  in  P may  thorefoie  be  judged  by 
mpeetion  ot  p(0|P»y)  for  various  values  of  P. 

•TION  (10  KPIUtftTI  PI  CAT  I ON 

Suppose  now  that  we  introduce  a prior  density  p(P) 
li  appi  t'x  i ma  t es  t he  probab  i 1 i l y of  occur  re  nee  of  p in  t he 
world.  Then  we  can  obtain  p(p|y)  from  / p (0 , P | y ) dO . 
is  the  postei i or  distribution  of  P given  the  prior 
p(.  ) and  given  the  data.  Then 

!’  ( 0 | y ) /p  (0  | P , y)  p (P  | y ) dp  (8) 

• which  (robust)  interences  may  be  made  about  0 indepen- 

* 

t ly  ot  p as  t equ i red . 

Intel  once  are  best  made  by  considering  the  whole  poste- 
dtstt  ibut  it'n  however  it  pt'int  est  imates  are  needed  they 
..  o t course  be  obtained  by  considering  suitable  features  of 
postei  ior  disti  ibut  ion.  I'oi  example  t he  postei  ioi  mean 


<•  generally  the  density  function  for  y will  contain 
net'  paiaiiictfi'.  a liquations  (7)  and  (8)  will  then  apply 
’ these  parameteis  have  been  eliminated  by  integration. 


will  minimize  squared  error  loss.  Other  features  of  the 
posterior  distribution  will  provide  estimates  for  othei  loss 
functions  (see  for  example  (6],  [12]). 

It  does  seem  to  me  that  the  inclusion  of  a prior 
distribution  is  essential  in  the  formulation  of  robust 
problems.  lor  example,  the  reason  that  robust  if  iers  favour 
measures  of  location  alternative  to  the  sample  average  is 
surely  because  they  have  a prior  belief  that  real  error 
distributions  may  not  be  normal  but  may  have  heavy  tails 
and/or  may  contain  outliers.  They  evidence  that  belief 
covertly  by  the  kind  of  methods  and  functions  that  they  favour 
which  place  less  weight  on  extreme  observations.  1 think  it 
healthier  to  bring  such  beliefs  into  the  light  of  day  where 
they  can  be  critically  examined,  compared  with  reality,  and, 
where  necessary,  changed.  Some  examples  of  this  alternative 
approach  are  now  given. 


EXAMPLE  1:  KURTOSIS  AND  THE  PAIRED  t TEST 


This  section  follows  the  discussion  by  Box  and  Tiao  [6], 
[13],  [14]  of  Darwin's  data  quoted  by  Fisher  on  the  differ- 

ences in  heights  of  15  pairs  of  self  and  cross-f er ti 1 i zed 
plants.  These  differences  are  indicated  by  the  dots  in 
Figure  2.  The  curve  labeled  B = 0 in  that  diagram  is  a t 
distribution  centered  at  the  average  y = 20.93  with  scale 
factor  s//n  = 9.75.  On  standard  normal  assumptions  it  can 
be  interpreted  as  a confidence  distribution,  a fiducial 
distribution  or  a posterior  distribution  of  the  mean  differ- 
ence 0.  From  the  Bayesian  view  point  the  distribution  can 
be  written 


P(0 | y) 


const 


(9) 


and  results  from  taking  a non-i nformat i ve  prior  distribution 
for  the  mean  0 and  the  standard  deviation  a.  Alternatively 
we  may  write  the  distribution  (9)  in  the  form 


- n 

p ( 0 | y ) = const [ T ( y - 0)2]  2 


(10) 


and  if 


Figure  2.  Posterior  distributions  p(6|g,y)  of  mean  differ- 
ence 6 for  parent  distributions  having  differing 
amounts  of  kurtosis  parameterized  by  6-  Darw:.n'r 

data . 


M ( 6 ,q)  = I|y.  - 0lq,  q > 1 

(10)  may  be  written  as 

_ n 

P (9  | y)  = const (M ( 0,2)}  2 . (n;. 

SENSITIVITY  TO  KURTOSIS 

One  way  to  consider  discrepancies  arising  from  non-normal 
kurtosis  is  to  extend  the  class  of  density  functions,  using 
the  exponential  power  family 


p(y| 9,0,8) 


~ c ( 3 ) 


2/ (1+8) 


(12' 


r [|  (1  + 6))' 
rii  (i  + B)] 


where  with  c(B) 


1 

1 + B 


and  0 and  a arc  thr 


moan  and  standard  deviation  as  before.  Then  using  the  same 
noninformat ive  prior  distribution  as  before  p(O,o|B)a»0  ^ 
it  is  easily  shown  that  in  general 

- \ n(l4P) 

p(0|B,y)  = const  M{t),2/(1  + 8)  } (13) 

and  in  particular  if  B = 0 (13)  and  (11)  are  identical. 

The  performance  character i Stic  here  is  not  a single 
quantity  but  the  whole  posterior  distribution  from  which  all 
inferences  about  0 can  be  made. 

Sensitivity  of  the  inference  to  changes  in  B is  shown 
by  the  changes  that  occur  in  the  posterior  distributions 
p(0|B,y)  when  B is  changed.  Figure  2 shows  these  distribu- 
tions for  various  values  of  B.  Evidently,  for  this  example, 

* 

inferences  are  quite  sensitive  to  changes  in  the  parent 
density  involving  more  or  less  kurtosis. 

ROBUST I FI CAT  ION  FOR  KURTOSIS 

As  was  earlier  explained,  high  sensitivity  alone  does  not 
necessarily  produce  lack  of  robustness.  This  depends  also 
on  how  large  are  the  discrepancies  which  are  likely  to  occur, 
represented  in  (8)  by  the  factor  p(B|y).  It  is  convenient 
to  define  a function  Pu ( B | y ) = p(B|y)/p(B)  which  fills  the 
role  of  a pseudo-likelihood  and  represents  the  contribution 
of  information  about  8 coming  from  the  data.  This  factor 
is  the  posterior  distribution  of  8 when  the  prior  is  taken 
to  be  uniform.  With  this  notation  then  for  the  present 
example 

1 1 

p ( 0 I y ) = / p(0|  B,y)p  (B|y)p(B)dB  = / p (6 , B | y) p (B) dS  . (14 ) 

-1.  ~ - u -1  u ~ ' 

For  Darwin's  data  the  distributions  p (9,f$|y)  and  a 
particular  p(B)  are  shown  in  Figure  3.  Figure  4 shows 


1 ft 

Notice  however,  the  distinction  that  must  be  drawn  between 
criterion  and  inference  robustness  [6],  [14].  For  example, 
for  these  data  the  s ign i f icance  1 cvel  of  the  t_  cr  i ter  ion  is 
changed  hardly  at  all  (from  2. 4851  to  2.388i)  if  we  suppose 
the  parent  distribution  is  rectangular  rather  than  normal. 


P (O.ely) 

u * 


"iquro  3.  Joint  postorioi  distribution  p (P#6|y)  with 
a particular  prior  p(fl).  Darwin's  data. 

■ !'  ! V ) for  various  choices  of  p(P)  while  Figure  5 shows 
co v responding  distribution  p(0|y). 

P. iking  p(S)  a delta  function  at  P - 0 corresponds 
with  the  familial  absolute  assumption  of  normality.  It 
results  in  distribution  (a)  in  Figure  5 which  is  the 
scaled  t referred  to  earlier. 

' This  choice  for  p(P)  is  appropriate  to  a prior  assump- 
tion that  although  not  all  distributions  are  normal; 
variations  in  kurtosis  are  such  that  the  normal  distribu- 
ion  takes  a central  role.  For  this  particular  example 
the  resulting  distribution  (b)  in  Figure  5 is  not  very 
different  from  the  t distribution. 

' F'ie,  by  making  p(P)  uniform  the  modifier  or  pseudo- 
likelihood  p ( p | y ) is  explicitly  produced  which  repre- 
sents the  information  about  kurtosis  coming  from  the 
ample  itself.  For  this  extreme  form  of  prior  distribu- 
tion, di st i ibut ion  (c)  in  Figure  h is  somewhat  changed 
although  not  dramatically.  The  reason  for  this  is  that 


the  widely  discrepant  distributions  in  Figure  2 for  nega- 
tive values  of  0 are  discounted  by  the  information 
coming  from  the  data. 

d)  This  distribution  is  introduced  to  represent  the  kind  of 
prior  ideas  which,  following  Tukey,  many  current  robus- 
tifiers  seem  to  have  adopted.  The  resulting  posterior 
distrioution  is  shown  in  Figure  5(d). 

This  example  brings  to  our  attention  the  potential 
importance  even  for  small  samples  of  information  coming  from 
the  data  about  the  parameters  0.  In  general  if  we  compute 
the  modifier  Pu(B)y)  from 

/ p(0,B|y)d0 

Pu'slp  - ;(g)~  ~ <15> 

then  we  can  write 

p(ejy)  = fp(B | 8,y) pu (0 | y)p (0)d6  . (16) 

Now  even  when  the  sensitivity  factor  is  high  that  is 
when  p ( 0 f 6 , y ) changes  rapidly  as  8 changes,  this  will 
lead  to  no  uncertainty  about  p(9|y)  if  p(8|y)  is  sharp. 
This  can  be  so  either  if  p(B)  is  sharp  - there  is  an 
absolute  assumption  that  we  know  0 a priori  - or  if  p (0|y) 
is  sharp.  In  the  common  situation,  the  spread  of  Pu(6|y) 
will  be  proportional  to  l//n  and  for  sufficiently  large 
samples  there  will  be  a great  deal  of  information  from  the 
sample  about  the  relevant  discrepancy  parameters  0.  For 
small  samples  however  this  is  not  generally  so.  This  amounts 
to  saying  that  for  sufficiently  large  samples  it  is  always 
possible  to  check  assumptions  and  in  principle  to  robust j f y 
by  incorporating  sample  information  about  discrepancies  6 
into  our  statistical  procedure.  For  small  samples  we  are 
always  much  more  at  the  mercy  of  the  accuracy  of  prior 
information  whether  we  incorporate  it  by  using  Bayes  theorem 
or  not.  Notice  however,  how  a Bayes  analysis  can  make  use 
of  sample  data  which  would  have  been  neglected  by  a samp), ing 
theory  analysis.  Comparison  of  Figures  2 and  5(c)  makes 
clear  the  profound  effect  that  the  sampling  information 
about  0 for  only  n = 15  observations  has  on  the 
inferential  situation.  This  sample  information  represented 
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by  Pu (r : y>  in  Figure  1(c),  although  vague,  is  effective  in 
discounting  the  possibility  of  plutykurtic  distributions 
which  art  t tie  major  cause  of  discrepancy  in  Figure  2.  This 
effect  accounts  for  the  very  moderate  chances  that  occur  in 
p(0|y)  accompany ing  the  drastic  changes  made  in  p(6). 

KXAMPI  L 2.  SERIAL  CORRELATION  AND  REGRESSION  ANALYSIS 


Coen,  Gonime  and  Kendall  [9]  gave  55  quarterly  values  of 

the  Financial  Times  ordinary  share  index  y ^ , U.K.  car 

production  and  Financial  Times  commodity  index  X2t‘ 

They  related  y to  the  lagged  values  X..  and  Xn. 

lt-b  zt-  / 

by  a regression  equation 


= 


V + 


OX  + 

2Alt-6 


!'3X2t-7  + Ct 


(17) 


which  they  fitted  by  least  squares.  As  mentioned  earlier  they 
obtained  estimates  of  0^  and  0,  which  were  very  highly 
significantly  different  from  zero  and  concluded  that  X^ 
and  X^  could  be  used  as  "leading  indicators"  to  predict 
future  share  prices.  Box  and  Newbold  (10)  pointed  out  that 
if  allowance  is  made  for  the  serial  correlation  which  exists 
in  the  error  then  the  apparently  significant  effects 

vanish  and  much  better  forecasts  are  obtained  by  using  today's 
price  to  forecast  the  future.  This  is  a case  where  infer- 
ences about  0 are  very  non-robust  to  possible  serial 
correlation . 

In  a recent  Wisconsin  Ph.D.  thesis  [15]  Lars  Fallesen 
reassessed  the  situation  with  a Bayesian  analysis,  supposing 
that  t may  follow  a first  order  autoregressive  model 
e - ®rt_2  = clt»  where  afc  is  a white  noise  sequence  as 
earlier  defined. 

The  dramatic  shifts  that  occur  in  the  posterior  distribu- 
tions of  02  and  when  it  is  not  assumed  a priori  that 

6=0  are  shown  in  Figures  6 and  7.  The  situation  is 
further  illuminated  by  Figures  8 and  9 which  show  the  joint 
distribution  of  02  and  6 and  of  0^  and  8,  together 
with  the  marginal  distribution  pu(8|y)  based  on  non- 
informntive  prior  distributions. 
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Figure  6.  Effect  of  different  assumptions  on  posterior  dis- 
tribution of  02.  p (62 1 y)  allows  for  possible 
autocorrelation  of  errors  p ( 6 2 | y , B = 0)  does  not 
0^  is  regression  coefficient  of  Share  Index  on 
car  sales  lagged  six  quarters. 


p(83ly)  p(93|y,B=0) 
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Figure  7.  Effect  of  different  assumptions  on  posterior  dis- 
tribution of  03-  P ( 9 3 I V ) allows  for  possible 
autocorrelation  of  errors  p ( 0 3 ! y » 6 = 0)  does  not 
©3  is  regression  coefficient  of  Share  Index  on 
Consumer  Frice  Index  lagged  seven  quarters. 


EXAMPLE  < 


OUTLIERS  TN  STANDARD  STATISTICAL  MODELS 


Consider  again  the  model  form  (1) 


t ( f.  , 0 ) ♦ 
u - 


With  standard  as  sum:  t ions  about 


1,2,  . 


with  the  restriction  that 


the  expcctat  ion  function  is  linear  in  0 this  is  the  widely 
used  Normal  linear  model.  The  remarkable  thing  about  this 
model  is  that  it  is  ever  seriously  entertained  even  when 
assumptions.  ; independence  and  normality  seem  plausible.  For 
it  specifically  states  that  the  model  form  is  appropriate 
for  u l,2,...,n  (that  is,  for  every  cute  of  the  experiments 
run).  Now  anyone  who  has  any  experience  of  reality  knows  that 
data  are  frequently  affected  by  accidents  of  one  kind  or 
another  resulting  in  "bad  values".  In  particular  it  is 
expecting  too  much  of  any  flesh  and  blood  experimenter  that 
he  could  conduct  experiments  unerringly  according  to  a pre- 
arranged plan.  Every  now  and  again  at  some  stage  in  the 
generation  of  data,  a mistake  will  be  made  which  is  unrecog- 
nized at  the  time.  Thus  a much  more  realistic  statement 
would  be  that  model  like  (18)  applied,  not  for  u = 1 , 2 , . . . , n , 
but  in  a proportion  1 - a of  the  time  and  that  during  the 
remaining  proportion  a of  the  time  some  discrepant,  impre- 
cisely known,  model  was  appropriate.  Such  a model  was 
proposed  by  Tukey  in  1960  [161.  We  call  observations  from 
the  first  model  "good"  and  those  from  the  second  model  "bad". 

This  type  of  model  was  later  used  in  a Bayesian  context 
by  Box  and  Tiao  [17].  They  assumed  that  the  discrepant  model 
which  generated  the  bad  values  was  of  the  same  form  as  the 
standard  model  except  that  the  error  standard  deviation  was 
k times  as  large.  The  results  are  rather  insensitive  to  the 
choice  of  k.  A Bayesian  analysis  was  later  carried  out  by 
Abraham  and  Box  [18]  with  a somewhat  different  version  of  the 
model  which  assumes  that  the  discrepant  errors  contain  an 
unknown  bias  6. 

Either  approach  yields  results  which  are  broadly  similar 
in  that  the  posterior  distribution  of  the  parameters  0 
appears  as  a weighted  sum  of  distributions. 

n n 

p (0  | y)  = w p (0  | y)  + V w . p . (0  | y ) + J w . . p . . (0  | y)  + . . . (19) 

1 — 1 i , j=l  l-] 


Th*_  p (6 |y)  in  the  first  term  or*  the  right 

would  be  appropriate  if  all  n observations  were  good,  i.e. 
generated  from  the  central  model.  The  distribution  P1(G|y) 
in  the  first  summation  allows  the  possibility  that  n - 1 
observations  are  good  but  the  ith  is  bad.  The  next  summa- 
tion allows  for  two  bad  observations  and  so  on.  The  weights 
w are  posterior  probabilities,  wq  that  no  observation  is 

bad,  w.  that  only  the  ith  observation  is  bad,  w.  . that 
i lj 

the  ith  and  jth  observations  are  bad  and  so  on. 

Strictly  the  series  includes  all  2n  possibilities  but 
n practice  terms  after  the  first  or  second  summation 
usually  become  negligible. 

Figure  10  shows  an  analysis  for  the  Darwin  data  men- 
tioned earlier.  In  this  analysis  it  is  supposed  that  the 

2 

error  distribution  for  good  values  is  N(0,o  ) and  that  for 

2 2 

bad  values  is  N(0,k  o ) . The  analysis  is  made,  as  before, 
using  a non- inf ormative  prior  for  8 and  a with  k = 5 an 
a = .05.  This  choice  of  a is  equivalent  to  supposing  that 
with  20  observations  there  is  a 63.2%  chance  that  one  or 
more  observations  are  bad.  The  probability  of  at  least  one 
outlier  for  other  choices  of  n and  a are  given  below 


98.0 


86.5 


The  results  [17]  are  very  insensitive  to  the  choice  of 
k but  are  less  insensitive  to  the  choice  of  a.  However 

it  should  be  possible  for  the  investigator  to  guess  this 
value  of  a reasonably  well. 

2)  rhe  calculation  can  be  carried  out  for  different  a 

values  and  the  effect  of  different  choices  considered 

[1R]  . 


Figure  10.  A.  Assuming  no  outliers, 

B.  Allowing  the  possibility  of  outliers. 

C.  Assuming  and  y^  are  outliers. 


Inspection  of  the  weights  w can  also  be  informative 
in  indicating  possible  outliers.  For  example  [19]  the  time 
series  shown  in  Figure  11  consists  of  70  observations 
generated  from  the  model: 

yt  = *yt-i  + 6t  + at 

where 

f 5 if  t = 50 

6 = | (20 
[ 0 otherwise 

4>  = .5  and  (a*.)  a set  of  independent  normally  distributed 

2 

random  variables  with  variance  o = 1.  The  plot  in  Figure  12 
of  the  weights  w^  indicates  the  probability  of  each 
being  bad  and  clearly  points  to  discrepancy  of  the  50th 
observation . 
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Figure  11.  A time  series  generated  from  a first  order 
autoregressive  model  with  an  outlier 
innovation  at  t c 50. 
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Figure  12.  Posterior  probabilities  of  bad  values 
given  that  there  is  one  bad  value. 
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l\ii  simony  tavois  an\  ili'vu't  tli.it  expands  model  -i|'|'1  ica  — 

b 1 1 i t y with  small  f x f >i • i . ! i t ut  < ■ i > I add i t i ona  1 pa  i ame  t > is.  As 

was  einj  'ha  ■■  i ced  tv  Ft  v.hri  , in  suit  at  1 1>  ci  i eunist  ancos  , pat  a- 

nu'  trio  t i.uisIoi.Ml  Ian  'iu  pi  . >v  j lit  ■ such  a iicv  i cu  . For  examp  1 u 

a MutaMi’  power  t i ansi  Tiuat  ion  Y y'  can  have  profound 

effect  when  y y is  not  small, 

max  'nun 

I”  this  application  then  tin.1  discrepancy  parameter  h 
measure  * tin  neeil  tot  ti  ansloi  mat  ion.  In  pa  r t i cu  1 a r f or  tin1 
power  t r a ns  t o i ina  t ions  no  t t a ns  ! or  mu  t i on  corresponds  to 
8 \ 1 . 

The  Hayes  approach  to  parametric  transformat ions  was 
explored  by  Box  and  Cox  [11).  One  example  they  con; idored 
concet ned  a 1 ' 4 tactorial  design  with  4 animals  per  cell 
in  which  a t >tal  of  n 48  animals  were  exposed  to  three 
different  po i sons  and  tout  different  treatments.  The 
response  was  survival  time.  Since  tor  this  data 
ymax/ym i n 12.4/1.8  7 we  know  a priori  that  the  effect 

ot  needed  transformation  could  be  profound  and  it  would 
be  sensible  ta  make  provision  for  it  in  the  I i rut  tentative 
model  . 

For  this  particular  set  of  data,  where  there  is  a blatant 
need  for  t r.rnsformat  ion,  an  initial  analysis  with  no  transforma- 
t ion  followed  by  rterat  ive  t ix  up  would  be  effective  also. 
Diagnostic  chocks  involving  residual  plots  of  the  kind  sug- 
gested by  Anscombe  and  Tukey  (20),  [21)  certainly  indicate  [22 1 

the  dependence  of  cell  variance  on  cell  mean  and  less  clearly 
non-add i t i v i t y . Whatever  route  we  take  we  are  led  to 
consider  a transformation  y where  \ approaches  -1.  As 
will  be  seen  from  the  analysis  of  vat iance  below  this  trans- 
format  i'n  not  only  el  iminat  os  any  suggestion  of  an  interaction 
bet  ween  po  i sons  and  t nut  merit  s but  also  at  eat  ly  increases  preci- 
: ion . Th  i s exumpl  e seems  to  fur  t iter  i 1 lust  i at  e how  Bayes  i an  robus  - 
tification  ot  the  model  illuminates  the  relation  of  the  data 
to  a spectrum  ot  models.  Using  nonir, format  ive  prior  distribu- 
tions I l gure  1 1 shows  pos  t or  i or  d i s t i i but  i ons  t oi  \ w i t h d i f t ei  - 
it. t constraints  applied  to  the  basic  normal,  independent,  model 
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where  the  subset  ipts  r , c,  i .ipply  to  rows,  columns  and  replicates. 


/wialyspr.  of  Variance  of  the  Biological  Data 
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Figure  13.  Posterior  distributions  for  X under  various 
constraints  N-Normality,  Homogeneity  of 
va i i anco , A- Add i t ivi ty . 


The  nature  of  the  various  distributions  is  indicated  in 
the  following  table  in  which  N,  H,  and  A refer  respec- 
tively to  Normality,  Homogeneity  of  Variance  and  Additivity 
and  p^.  and  are  row  and  column  effects 


Di stri but  ion 
Pu  (X |N,y) 

Pu  <X|HN.Y) 

p ( \ | AHN,  y) 


Const  1 a int 

V ( r . ) « o 2 
rci  rc 


2 2 

o « o 

rc 


»‘rc 


p + p + Y 
r c 


and 


o2  - u2 
rc 


e disperse  nature  of  the  distribution  Pu(\jN,y)  is  to  be 
expected  since  a sample  of  size  n = *18  cannot  tell  us  much 
about  normality.  The  greater  concent  rat  i on  of  p (\|llN,y) 

: er  because  there  is  considerable  variance  heterogeneity 
\ r h<  original  metric  which  is  corrected  by  strong 


transformations  in  the  neighborhood  of  the  reciprocal. 

Finally  p(\|AHN)  is  even  more  eoncc'nt  rated  because  tians- 
formations  of  this  type  also  remove  possible  non-additivity. 
The  analysis  suggests  among  other  things  that  for  this  data 
the  choices  oi  transformations  yielding  approximate  addi- 
tivity, homogeneity  of  variance  and  normality  are  not  in 
conflict  . From  Figure  14  (taken  from  122))  we  see  that  for 
this  example  appropriate  adjustment  of  the  discrepancy  para- 
meter B=  \ affects  not  only  t ho  loca  t i on  of  the  poison  main 
effects,  it  also  has  a profound  effect  on  their  precision  . 
Indeed  the  effect  of  including  \ in  estimating  main  effects 
is  equivalent  to  increasing  the  sample  size  by  a fact  oi  of 
almost  three. 


Figure  14.  Posterior  distributions  for  individual  means 
(poisson  main  effects)  on  original  and 
reciprocal  scale.  Note  greatly  increased 
precision  induced  by  appropriate  transformation. 

PS YCH INC,  OUT  THK  ROBUST T F Ij-'.RS 

To  apply  Bayesian  analysis  we  must  choose  a p(P)  which 
roughly  describes  the  world  we  believe  applies  in  the  problem 
context . 

There  are  a number  of  ways  we  can  be  assisted  in  this 
choice . 

(1)  We  can  look  at  extensive  sets  of  data  and  build  up  suit- 
able distributions  p(fi)  from  experience. 

Similar  results  are  obtained  for  the  treatment  effect:;  and 
as  noted  before  the  transformation  eliminates  the  need  for 
interact  ion  parameters. 


k 


( i i ) We  can  consult  experts  who  have  handled  a lot  of  data 

of  the  kind  being  considered. 

(iai)  We  cun  consider  the  nature  of  the  robust  estimates 
proposed  and  what  they  reveal  about  the  proposer  ' s prior  bel  iet  s. 

Consider,  tor  example,  the  heavy  tailed  error  problem. 

Gina  Chen  in  a recent  Dh.D.  thesis  (23]  has  found  prior 
distributions  p(8)  yielding  posterior  means  which  appioxi 
mate  robust  estimates  of  location  already  proposed  on  other 
grounds.  In  one  part  of  her  study  she  considers  a model  in 
which  data  come  from  an  exponential  power  distribution. 
p(y| 0,o,8)  of  the  form  of  (11)  with  probability  1 - a, 
and,  with  probability  a to  have  come  from  a similar  distri- 
bution but  with  a standard  deviation  k times  as  large. 

Thus 

P ( y | 0 , o , a , 6 ) = (1  - ot)  p (y  | 0 , o , 0)  + a • p (y  | 0 , ko  , 8)  . (22) 

It  turns  out  in  fact  that  priors  which  put  all  the  mass  at 
individual  points  in  the  p(cx,8)  plane  can  very  closely 
approximate  suggested  M estimators  as  well  as  trimmed 
means,  Winsorizod  means,  and  other  I,  estimators. 

Three  objectives  of  her  study  were 

(i)  To  make  it  possible  to  examine  more  closely  and  hence 
to  criticize  the  assumptions  about  the  real  world  which 
would  lead  to  the  various  robust  estimates. 

(ii)  To  compare  these  revealed  assumptions  with  the 
properties  of  actual  data. 

(iii)  To  allow  conclusions  obtained  from  simple  problems  to 
be  applied  more  generally.  Once  we  agree  on  what  p(B) 
should  be  for  a location  parameter  then  the  same  p(B)  . .n 
be  used  for  more  complicated  problems  occurring  in  the  sum-- 
scientific  context.  Direct  application  of  Bayes  theorem 
can  then,  for  example,  indicate  the  appropriate  analysis  * > 
all  linear  and  nonlinear  models  formerly  analyzed  by  least 
squares . 

SUMMARY  AMD  CONCLUSIONS 

A major  activity  of  statisticians  should  be  to  help  t he 
scientist  in  his  iterative  search  for  useful  but  neccssai  ly 
inexact  parsimonious  models.  While  inexact  models  may 
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mislead,  attempt  mg  1 1'  allow  loi  I'Vt-ry  font  ingenoy  a pi  ioi  i is 
impt  act  i fa  1 . Thus  models  must  be  built  by  an  iterative  food- 
back  procf  s in  which  an  init  ial  pan;  imonious  model  may  bo 
modified  when  d tannest  10  checks  applied  to  residuals  indicate 
t he  need . 

When  d i soi epano i es  may  oocur  which  are  unlikely  to  be 
detected  by  diaanost  ic  fht'fks,  this  feedback  process  could 
tail  and  therefore  pt  oeedures  must  be  robust  if  i ed  with  respect 
to  these  part  iculur  kinds  of  d i scfepanc  i os  . This  writer 
believes  that  this  may  best  be  done  by  suitably  modifying 
the  model  rather  than  by  modi ly inn  t ho  method  of  inference. 

In  particular  a Hayes  approach  otters  many  advantages. 
Suppose  the  scientist  wisher,  t o protect  inferences  about 
primary  pat  anu-t  ei  s ’ from  effects  ot  disoiepanoy  parameters 
t* . Bayes  analysis  automat  ical  ly  brings  into  the  open  a 
number  of  important  elements. 

(i)  The  pi  ior  distribution  p(s)  reveals  the  nature  of 
the  supposed  universe  of  d i scropanc i os  from  which  tin' 
procedure  is  being  protected. 

lii)  The  distribution  p^U'ly)  p(t')y)  p(t')  represents 
infer  mat  ion  about  t'  coming  from  the  data  itself.  This 
distribution  may  bo  inspected  for  concordance  with  p(t'). 

(iii)  The  cond.it  ional  poster  101  distt  1 but  ion  p ( 0 \ f , y ) shows 
the  sensitivity  of  inferences  to  choice  of  S . 

( i v ) From  t ho  marginal  poster iot  distribut  ion  p(Ojy) 
appropriate  inferences  which  are  lobust  with  respect  to  (' 
may  be  made . 

(v)  Implications  of  inspired  empit  if  ism  can  lead  to  useful 


mod  e l s . 

For  exampl 

t'  / Wt' 

can 

ask 

"What 

kind  of 

p ( t' ) will 

make  some 

emp i r ica 1 

r obu 

s t me 

asm 

«'  of 

1 coat  ion 

a Bayesian 

est ima tor 

.' " Th  l s 

p ( f ) 

may 

t hi-n 

be  e 

xam i nod , 

ct  it  i c i fed  and 

perhaps  compared  with  distr  ibut  ions  of  a encountered  in 
the  real  world. 

(vi)  Once  p(t')  is  agreed  on  then  that  same  p(B)  can  be 
applied  to  other  problems.  For  example,  we  do  not  need  t o 
give  special  cons  iderat  ion  to  tobust  regression,  robust 


analysis  or  variance,  robust  non-linear  estimation.  We 
simply  carry  through  the  Bayesian  analysis  with  the  agreed 
P(B>  • 

(vii)  In  the  past  the  available  capacity  and  speed  of  computers 
might  have  limited  this  approach  but  this  is  no  longer  true. 
It  will  be  necessary  however,  to  make  a major  effort  to 
produce  suitable  programs  which  can  readily  perform  analyses 
and  display  results  of  the  kind  exemplified  in  this  paper. 


APPENDIX  1 


Suppose  that, in  model  (1),  n observations  are  available 
and  standard  assumptions  of  independence  and  homoscedasticity 
are  made  about  the  errors  {eu}.  Suppose  finally  that  the 
object  is  to  estimate  E(y)  over  a region  in  the  space  of  £ 
"covered"  by  the  experiments  }.  Then  the  number  of  para- 

meters p employed  in  the  expectation  function  is  a natural 
measure  of  prodigality  and  its  reciprocal  1/p  of  parsimony. 


< ;timates 


e 

-p 


(P)  = f (P)  ,r 
u -u'  -p' 

obtained  by  least  squares  and  by 


Now  denote  by  y'^'  = f 9_)  a fitted  value  with 


V(P>  = 


= l V{y^p)}/n 
u 


(A.l) 


‘he  average  variance  of  the  n fitted  values. 

It  is  well  known  that  (exactly  if  the  expectation  func- 
t. on  is  linear  in  8,  and  in  favorable  circumstances, 
approximately  otherwise)  no  matter  what  experimental  design 

is  used 

u 

V(P)  = po2/n  . (A. 2) 

Now  if  the  (nu)  can  be  regarded  as  a sampling  of  the  func- 
'-ion  over  the  region  of  interest,  then  V^p^  provides  a 
me  sure  of  average  variance  of  estimate  of  the  function  over 
tv-e  exper imental  region. 

Equation  (A. 2)  says  that  this  average  variance  of 
es*  .mate  of  the  function  is  proportional  to  the  prodigality  p. 
Alternatively  it  is  reasonable  to  regard  I ^ = {V^}  * 
as  a measure  of  information  supplied  by  the  experiment  about 

tL r function  and 

I *P^  = n/po2  . (A . 3) 
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Thus  this  measure  of  information  is  proportional  to  the 
pai simony  1/p.  Foi  example,  if  the  expectation  function 
needed  as  many  parameters  as  there  were  observations  so  that 
p n then  = y and 


V(n>  V(v 


(y  t>  o 


I(n)  = I/O2 


(A. 4) 


In  this  case  the  model  does  not  summarize  information  and 
does  not  help  in  reducing  the  variance  of  estimate  of  the 
t unc t ion . 

At  the  other  extreme  if  the  model  needed  to  contain  only 
a single  parameter,  for  example, 


Yt  0 4 c 


(A. 5) 


then  i>  - y^  « y and 

- ( 1 ) - 2 

= V (y)  oV  n 


T (D  . n .2 

*=  n o 


( A . b ) 


In  this  case  the  use  of  the  model  results  in  considerable 
summarizing  of  information  and  reduces  the  variance  of  estimate 
of  the  function  n times  or  equivalently  increases  the 
information  measure  n-fold. 

Considerations  of  this  sort  weigh  heavily  against 
unnecessar i 1 y compl icat  ed  model s . 

As  an  example  of  unnecessary  complication  consider  an 
experimenter  who  wished  to  model  the  deviation  v from  its 
mean  of  the  output  from  a stirred  mixing  tank  in  terms  of 
the  deviation  f. ^ from  its  mean  of  input  feed  concentration. 

If  data  were  available  to  equal  intervals  of  time,  he  might 
use  a model 


(A. 7) 


in  which  k was  taken  sufficiently  large  so  that  deviations 
in  input  for  q > 0 were  assumed  to  have  negligible 

effect  on  the  output  at  time  t.  This  model  contains  k + 1 
parameters  0 which  need  to  be  estimated.  Alternatively 
it  he  knew  something  about  the  theory  of  mixing  hi'  miuht 
instead  tentatively  entertain  the  model 


* V,-l  4 0^,-2  4 •••>  4 


(A  . 8) 


oi  equ i va 1 ent 1 y 


which  contains  only  two  parameters  0. 

Thus  if  the  simpler  model  provided  a fair  approximation 
it  o ild  lesult  in  greatly  increased  precision  as  well  as 
undei s t undi ng . 

APPKNUIX  2 

The  practical  importance  of  worrying  about  the  right 
things  is  illustrated,  for  example,  by  the  entries  in  the 
following  table  taken  from  [7],  [22].  This  shows  the  result 

ot  a sampling  experiment  designed  to  compare  the  robustness 
to  non-normality  and  to  serial  correlation  of  the  signifi- 
cance level  of  the  t test  and  the  non-parametric 
Mann-Khitney  test.  One  thousand  pairs  of  samples  of  ten  of 
independent  random  variables  u^  were  drawn  from  a rectan 
gulat  distribution,  a normal  distribution  and  a highly  skewo 
distribution  [a  x with  4 degrees  of  freedom)  all  adjust . e< 
to  have  mean  zero.  In  the  first  row  of  the  table  the  errw 
r ^ - <-'t  were  independently  distributed,  in  the  second  and 
third  rows  a moving  average  model  = u^  - 6u  was  us.’' 

to  geneiate  errors  with  serial  correlation  -0.4  and  a 0.4 
respectively.  The  numbers  on  the  right  show  the  corrospondi  i 
results  when  the  pairs  of  samples  were  randomized. 

In  this  example  the  performance  characteristic  studied 
is  the  numbers  of  samples  showing  significance  at  the  5% 

1 eve  1 when  t he  null  hypotheses  of  equality  of  means  was  in 
act  turn.  Undei  ideal  assumptions  the  number  observed 
w.  ul  i,  of  course,  vary  about  the  expected  value  of  30  with.  • 
a.iu  ling  standard  deviation  of  about  7.  It  is  not  intendi 
to  sagger. t by  this  example  that  the  performance  of  signifi 
came  tests  when  the  null  hypothesis  is  true  is  the  most 
important  thing  to  be  concerned  about.  but  rightly  or 
wronuty,  designers  of  non-parametr ic  tests  have  been  cone.’’ 
about  it,  and  demonstrations  of  this  kind  suggest  that  the  ' 
labors  ate  to  some  extent  misdirected.  In  this  example  i 
evident  that  it  is  the  physical  act  of  random! rat  ion  and  nv 
loss  so  the  choice  of  criterion  that  protects  the  signifi 


Tests  of  Two  Samples  of  Ten  Observations  Having  the  S 
Frequency  in  1,000  Trials  of  Significance  at  the  5 
Level  Using  the  t-Test  (t.)  and  the  Mann-Whitney 
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